16 research outputs found

    Improving toponym recognition accuracy of historical topographic maps

    Get PDF

    WordCrowd : a location-based application to explore the city based on geo-social media and semantics

    Get PDF
    WordCrowd is a dynamic location-based service that visualizes and analyzes geolocated social media data. By spatially clustering the data, areas of interest and their descriptions can be extracted and compared on different geographical scales. When walking through the city, the application visualizes the nearest areas of interest and presents these in a word cloud. By aggregating the data based on the country of origin of the original poster, we discover differences and similarities in tourist interest between different countries. This work is part of the project Eureca: European Region Enrichment in City Archives and Collections of Ghent University (IDLab, CartoGIS), the Technical University of Vienna (Research Group Cartography) and several city and state archives from Ghent and Vienna.(VLID)452639

    Automatic Georeferencing of Topographic Raster Maps

    No full text
    In recent years, many scientific institutions have digitized their collections, which often include a large variety of topographic raster maps. These raster maps provide accurate (historical) geographical information but cannot be integrated directly into a geographical information system (GIS) due to a lack of metadata. Additionally, the text labels on the map are usually not annotated, making it inefficient to query for specific toponyms. Manually georeferencing and annotating the text labels on these maps is not cost-effective for large collections. This work presents a fully automated georeferencing approach based on text recognition and geocoding pipeline. After recognizing the text on the maps, publicly available geocoders were used to determine a region of interest. The approach was validated on a collection of historical and contemporary topographic maps. We show that this approach can geolocate the topographic maps fairly accurately, resulting in an average georeferencing error of only 316 m (1.67%) and 287 m (0.90%) for 16 historical maps and 9 contemporary maps spanning 19 km and 32 km, respectively (scale 1:25,000 and 1:50,000). Furthermore, this approach allows the maps to be queried based on the recognized visible text and found toponyms, which further improves the accessibility and quality of the collection

    Automatic georeferencing of topographic raster maps

    No full text
    In recent years, many scientific institutions have digitized their collections, which often include a large variety of topographic raster maps. These raster maps provide accurate (historical) geographical information but cannot be integrated directly into a geographical information system (GIS) due to a lack of metadata. Additionally, the text labels on the map are usually not annotated, making it inefficient to query for specific toponyms. Manually georeferencing and annotating the text labels on these maps is not cost-effective for large collections. This work presents a fully automated georeferencing approach based on text recognition and geocoding pipeline. After recognizing the text on the maps, publicly available geocoders were used to determine a region of interest. The approach was validated on a collection of historical and contemporary topographic maps. We show that this approach can geolocate the topographic maps fairly accurately, resulting in an average georeferencing error of only 316 m (1.67%) and 287 m (0.90%) for 16 historical maps and 9 contemporary maps spanning 19 km and 32 km, respectively (scale 1:25,000 and 1:50,000). Furthermore, this approach allows the maps to be queried based on the recognized visible text and found toponyms, which further improves the accessibility and quality of the collection

    Species detection and segmentation of multi-specimen historical herbaria

    No full text
    Historically, herbarium specimens have provided users with documented occurrences of plants in specific locations over time. Herbarium collections have therefore been the basis of systematic botany for centuries (Younis et al. 2020). According to the latest summary report based on the data from Index Herbariorum, there are around 3400 active herbaria in the world containing 397 million specimens that are spread across 182 countries (Thiers 2021). Exponential growth in high quality image capturing devices induced by the enormous amount of uncovered collections has further led to rising interest in large scale digitization initiatives across the world (Le Bras et al. 2017). As herbarium specimens are increasingly becoming digitised and accessible in online repositories, an important need has also emerged to develop automated tools to process and enrich these collections to facilitate better access to the preserved archives. This rising number of digitised herbarium sheets provides an opportunity to employ computer-based image processing techniques, such as deep learning, to automatically identify species and higher taxa (Carranza-Rojas and Joly 2018, Carranza-Rojas et al. 2017, Younis et al. 2020) or to extract other useful information from the herbaria sheets, such as detecting handwritten text, color bars, scales and barcodes. The species identification task works well for herbarium sheets that have only one species in a page. However, there are many herbarium books that have multiple species on the same page (as shown in Fig. 1) for which the complexity of the identification problem increases tremendously. It also involves a great deal of time and effort if they are to be enriched manually. In this work, we propose a pipeline that can automatically detect, identify, and enrich plant species in multi-specimen herbaria. The core idea of the pipeline is to detect unique plant species and handwritten text around the plant species and map the text to the correct plant species. As shown in Fig. 2, the proposed pipeline begins with the pre-processing of the images. The images are rotated and aligned such that the longest edge is maintained as its height. In the case of herbarium books, the pages are detected and morphological transformations are performed to reduce occlusions (Thirukokaranam Chandrasekar and Verstockt 2020). A YOLOv3 (You Only Look Once version 3) object detection model (Zhao and Li 2020) is trained from scratch to detect plants and text. The model was trained on a dataset of single species herbarium sheets with a mosaic augmentation technique to extend the plants model to detect multiple species. The first results of the training shows impressive results although it could be further improved with more labelled data. We also plan to train an object segmentation model and contrast its performance with the plant detection model for multi-specimen herbarium sheets. After detecting both the plants and the text, the text will be recognized with a state-of-the-art handwritten text recognition (HTR) model. The recognized text can then be matched with a database of specimens, to identify each detected specimen. Furthermore, additional textual metadata (e.g. date, locality, collector's name, institution) visible on the sheet will be recognized and used to enrich the collection

    Automatic extraction of specimens from multi-specimen herbaria

    No full text
    Since herbarium specimens are increasingly becoming digitized and accessible in online repositories, an important need has emerged to develop automated tools to process and enrich these collections to facilitate better access to the preserved archives. Particularly, automatic enrichment of multi-specimen herbaria sheets poses unique challenges and problems that have not been adequately addressed. The complexity of localization of species in a page increases exponentially when multiple specimens are present in the same page. This already challenges the performance of models that work accurately with single specimens. Therefore, in this work, we have performed experiments to identify the models that perform well for the plant specimen localization problem. The major bottleneck for performing such experiments was the lack of labeled data. We also address this problem by proposing tools and algorithms to semi-automatically generate annotations for herbarium images. Based on our experiments, segmentation models perform much better than detection models for the task of plant localization. Our binary segmentation model can accurately extract specimens from the background and achieves an F1 score of 0.977. The ablation experiments for multi-specimen instance segmentation show that our proposed augmentation method provides a 38% increase in performance (0.51 [email protected] versus 0.37) on a dataset of 1,500 plant instances

    NewspAIper : AI-based metadata enrichment of historical newspaper collections

    No full text
    This paper presents a NewspAIper demonstrator that facilitates article-level search, cross collection linkage and exhibits how to improve the searchability of digitized cultural heritage collections
    corecore